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Abstract 

Protein is the working molecule of cell, and evolution is the hallmark of life. It is important to 
understand how protein folding and evolution influences each other. Several studies correlating experi- 
mental measurement of residue participation in folding nucleus and sequence conservation have reached 
different conclusions. These studies are based on assessment of sequence conservation at folding nucleus 
sites using entropy or relative entropy measurement derived from multiple sequence alignment. Here 
we report analysis of conservation of folding nucleus using an evolutionary model alternative to entropy 
based approaches. We employ a continuous time Markov model of codon substitution to distinguish 
mutation fixed by evolution and mutation fixed by chance. This model takes into account bias in codon 
frequency, bias favoring transition over transversion, as well as explicit phylogenetic information. We 
measure selection pressure using the ratio u) of synonymous vs. non-synonymous substitution at individ- 
ual residue site. The w-values are estimated using the Paml method, a maximum-likelihood estimator. 
Our results show that there is little correlation between the extent of kinetic participation in protein 
folding nucleus as measured by experimental 0-value and selection pressure as measured by w-value. 
In addition, two randomization tests failed to show that folding nucleus residues are significantly more 
conserved than the whole protein, or the median u> value of all residues in the protein. These results 
suggest that at the level of codon substitution, there is no indication that folding nucleus residues are 
significantly more conserved than other residues. We further reconstruct candidate ancestral residues 
of the folding nucleus and suggest possible test tube mutation studies for testing folding behavior of 
ancient folding nucleus. 
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Are amino acid residues important for rapid folding preferentially conserved during evolution? Does 
natural selection optimize proteins for folding kinetics? If protein folding involves initially the formation 
of a small region of native-like folding nucleus, are identities of these residues well conserved during 
evolution [1-5]? These fundamental questions of molecular biology have received much attention [1-12]. 
Of direct relevance are experimental 0- value studies, which provide information about the role individual 
amino acid residue play in the formation of folding nucleus [13-15]. By measuring the change A AG 
in protein stability and the change AAG* in folding barrier due to mutation of an amino acid residue, 
Rvalue (defined as = AAG^/AAG) for the mutated residue can be calculated. Rvalue has been used 
to measure the extent to which side chain of a mutated residue participates in native- like interactions. 
A <p- value of 0.0 indicates that the site of mutation is as unfolded as in the denatured state. A 0- value 
of 1.0 indicates that the site of mutation is as folded as in the native state, i.e., this residue is involved 
in native-like transition state structure, and is part of the folding nucleus. 0-value between and 1 is 
interpreted as possessing different degrees of structure in transition state [13]. The folding nucleus can 
be identified as formed by the set of residues with 4> values above a threshold {e.g., <fi > 0.5) [13]. Several 
computational methods have been developed for predicting protein folding mechanism and ^-values of 
residues. These include the sequential binary collision model [16], multisegment model [17], and singlc- 
to-triple sequence approximation model [18]. Model conformations of transition-state ensemble have also 
been generated explicitly by Monte Carlo sampling using Go-type potential derived from experimental 
</>-values constraints [19]. A lucid statistical mechanistic picture for understanding 0- value experiments 
can be found in [20, 21]. 

The evolutionary conservation of folding nucleus residues are the subject of several recent studies. 
These studies, however, have come to different conclusions. Plaxco et al and Larson et al showed that 
there may be little correlation between sequence conservation and participation in the folding transition 
state [8,9]. Mirny, Shahknovich and others demonstrated that for rapid folding, sequence identity of 
folding nucleus are more conserved within protein families and across protein supcrfamilies [2,7]. It is 
unclear whether the disagreement between these studies is due to the difference in entropy calculations 
as attributed in [7], or differences in choice and processing of the data set, in sequence alignments, in 
definition of folding nucleus, as well as intrinsic sample bias in Rvalue analysis, as discussed in details 
in [9]. 

In this study, we examine the conservation of folding nucleus residues using an approach that differs 
from previous studies in several aspects. First, instead of studying amino acid residue sequences, we 
examine the evolution of corresponding coding DNA sequences at codon level. Second, we use an explicit 
codon evolutionary model based on continuous time Markov process, which has yielded deep insights 
about the mechanisms of molecular evolution [22-24]. Instead of using entropy or relative entropy as 
quantitative measure of sequence conservation, we assess the ratio of mutation rates of synonymous 
vs. non-synonymous changes to detect natural selection at each amino acid residue position. Third, a 
phylogenetic tree is built to encode the closeness between proteins. Following previous studies [25,26], 
we use maximum likelihood method developed in [27] by Yang to estimate values of parameters of the 
evolutionary model and draw inference about the conservation of folding nucleus residues. 

We find experimental 0-values are not correlated with evolutionary conservation for seven proteins 
studied here. In addition, results using two statistical tests indicate that except possibly one protein, 
none of these proteins have folding nucleus more conserved than the rest of the proteins, or than the 
residue with median selection pressure. We have also reconstructed candidate ancestral folding nucleus 
residues, and have suggested exploratory test-tube mutation studies on the evolution of protein folding 
dynamics. 

Synonymous and nonsynonymous codon substitution. Protein sequences diverge from a 
common ancestor because mutations occur. Some fraction of these mutations are fixed into the evolving 
population by selection and some are fixed by chance, resulting in the substitution of one nucleotide for 
another nucleotide at various locations. Because evolution occurs at DNA level rather than at amino 
acid level, models of protein evolution based on codon usage are appealing and have been widely used 
[25,28-30]. In this study, we therefore consider substitutions at the codon level. A codon substitution 
can have two different outcomes for the nucleotide sequence of protein coding region: synonymous 
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substitution does not change the encoded sequence of amino acid residues, whereas nonsynonymous 
substitution leads to changes in the amino acid residues. Random mutation and selection pressure will 
have different effects on the rate of these two types of substitutions [31-33], and this difference can be 
exploited for detecting selective pressure at protein level [25,34-38]. Our key problem is to find out the 
ratio of the synonymous substitution rate d s and the nonsynonymous substitution rate d n . That is, we 
wish to estimate the ratio of the numbers of synonymous and non-synonymous substitutions at a specific 
site or a specific amino acid residue position. If natural selection offers no advantage, non-synonymous 
mutations will have the same rate as synonymous mutations (d n = d s ), and the ratio lu = d n /d s will 
be 1. If non-synonymous mutations are harmful, deleterious or lethal, purifying selection is at play and 
the rate for non-synonymous mutation will be reduced: we have d n < d s and u> < 1. On the other 
hand, if Darwinian positive selection favors non-synonymous mutation, we have d n > d s and u> > 1. 
Here u> is used as a measure of selection pressure. Substitution fixed by evolution and substitution 
fixed by chance are distinguished by examining the ratio u at various locations of amino acid residues. 
This technique has been frequently applied in studies of molecular evolution, e.g., in detecting adaptive 
evolution [22,38,39]. 

Continuous time Markov process for codon substitution. Markov model has been widely 
used in sequence analysis [40] and in evolutionary models [23]. In the current model, the outcome of 
codon substitution is determined only by the identity of codon in the ancestral sequence separated by 
divergence time t, and a codon transition probability matrix P(t). A phylogenetic tree is a key ingredient 
of this model. The topology and branch lengths of the tree reflects the evolutionary relationship among 
different proteins, which can model their closeness [23]. We follow the approach of [22,26,41], and 
describe below briefly the model. 

For a given phylogenetic tree, the parameters of the evolutionary model are a 61 x 61 rate matrix Q 
for 61 non-stop codons and the sequence divergence time ts (or the branch lengths) of the phylogenetic 
tree. The divergence time represents expected number of changes between sequences which are nodes 
in a phylogenetic tree. The entries Qij of matrix Q are infinitesimal substitution rates of nucleotides for 
the set C of 61 non-stop codons, and they are parametrized as: 



where \x is the basis rate, k is the transition/transversion rate ratio, ui the ratio of nonsynonymous and 
synonymous rates, and iTj is the codon frequency, which can be estimated as observed codon frequency 
in the sequences. In this model, the 61 x 61 rate matrix Q is fully determined by two parameters k and 
ui, since ttj can be estimated and fi is a constant [25,26]. 

For continuous time Markov process, the transition probability matrix of size 61 x 61 after time t is 
[24]: 

P(t) = {py(t)}=exp(Q-t) 

The entry Pij(t) represents the probability that codon i will mutate into codon j after time t. It is 
calculated through diagonalization of the Q matrix. 

ui ratio from likelihood of phylogeny. For node i and node j in a phylogenetic tree separated 
by divergence time Uj , the time reversible probability of observing nucleotide xt in a position h at node 
i and nucleotide Xj of the same position at node j is: 

f^XiPxiXj i^ij) — ^XjPxjXi(iij^) • (1) 

For a set S of s multiple-aligned sequences with n amino acid residues, we assume that a reasonably 
accurate phylogenetic tree T = (V, £) is given. Here V is the set of nodes (or vertices), namely, the union 
of the set of observed s sequences C (leaf nodes), and the set of s — 2 ancestral sequences X (internal 
nodes). £ is the set of edges (or branches) of the tree. Let the vector Xh = (x\, ■ ■ ■ , x s ) T be the observed 
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codons at position h for the s sequences. Without loss of generality, we assume that the root of the 
phylogenetic tree is an internal node k. Given the specified topology of the phylogenetic tree T and 
the set of branch lengths (or divergence times), and if the set of codons Cx of all internal nodes X is 
specified, the probability of observing the s number of codons Xh at position h is: 

p(x h \C x ,T) = TT Xk Y[ PxiXjiUj). 

Summing over the set C of all possible codons for the internal nodes I, we have 

P (x h \T) = n k J2 II ( 2 ) 

The probability of observing all codons in the coding region of the nucleotide sequences is: 

s 

P(S\T) = P( Xl , ■■■ , x s \T) = H P (x h \T). 

h=i 

To account for the possibility that the rate of nonsynonymous substitution can vary among different 
sites, the model developed in [41] allows M possible different classes of nonsynonymous substitutions 
with rates w\, ■ ■ ■ Each amino acid site falls into the M class with probabilities pi, ■ ■ ■ ,pm [41]. 

The probability of observing Xh is then modified from Equation (2), which gives p(xh\u) m ,T), to the 
following: 

M 

p(x h \T) =^2p m -p{x h \uj m ,T). 

m—1 

Repeat this calculation over all amino acid residue sites, we have 

s 

P(S\T) = l[p{x h \T), 

and the likelihood function is: 

s 

£(T)) = ^og\p(x h \T)}. 
h=l 

To estimate the parameters Kh, lo^ for each site h used in the mutation rate matrix Q, we use a 
Maximum Likelihood Estimator [26,37,42], the Paml package by Yang [27]. Our goal is to search for 
parameters n h and uj h such that the likelihood function t(T) is maximized. Here the number M of 
different classes of ui is 10, and they take the default values as assigned by Paml [41]. 

Once the model parameters are estimated, the empirical Bayes approach can be used to infer the 
most likely class of w value at each residue site [22]. In Paml, the posterior probability p(ui m \xh) that 
site h with observed codons Xh is from class m with rate ratio cj m is calculated as: 

p(ui m \x h ) =p m ■p(x h \u) m ,T)/p(x h \T) =p m ■p(x h \u) m ,T)/'^2p m ■ p(x h \ui m ,T). 

m 

Data collection and computational procedures. We follow [7] and study evolution of the set 
of proteins taken from Table 1 of [7], where the folding nucleus residues are defined. We first query with 
the sequence of each of the proteins against HSSP database [43] to obtain homologous protein sequences 
with overall sequence identity > 30% to ensure that they have the same fold. In some cases, we also 
searched the Ce server [44] for structural homologs. Experimentation using Psi-BLAST searching the 
NR-database of protein sequences give almost identical sets of sequences. In this study, all redundant 
sequences are removed. Since paralogous sequences in a single species may exist that can be matched to 
the query DNA sequence, we only take the sequence with the highest identity to the query protein when 
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FKBP12 (1fki) 




GVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVI RGWE EGVAQMSVGQRAKLT I SPDYAYGATGHPG I IPPHATL 



Residue 

Figure 1: Selection pressure as measured by u ratio of nonsynonymous vs. synonymous codon substitution rate varies 
at each amino acid residue site along the sequence of protein FKBP12. The ten possible uj values are grouped into three 
classes: uj a < 0.12 (dark), 0.12 < uib < 0.34 (gray), and 0.34 < uj c (light). The x-axis shows the residue number of 
the protein, the y-axis shows the posterior probability of to belonging to one of the three classes at each codon position. 
Residues with large probability for ui a (dark) are highly conserved residues experiencing strong purifying pressure. Folding 
nucleus residues as identified in [7] are marked by the symbol "*". 

multiple homologous sequences are found in a single species. With the exception of protein CI2 where 
sequences of two paralogs are included, only proteins with > 5 known orthologous DNA sequences are 
kept. We therefore exclude AcP protein and CD2.dl protein because fewer than 5 DNA sequences were 
found. Since paralogs are excluded, the number of sequences used in this study is smaller than that used 
in other studies [7-9] . The amino acid residue sequences of the remaining 7 proteins are first aligned 
using ClustalW with default parameters [45] and then with manual intervention. Alignment of the 
nucleotide sequences are generated following the alignment of the protein sequences. A phylogenetic 
tree T is constructed using maximum likelihood method as implemented in the Paup method [46]. 
This tree T is then used by the Paml package, an implementation of the maximum likelihood method 
for estimating lo values [27]. In many cases, minor difference in the tree does not affect final results 
significantly [47, 48] . For each protein, we repeatedly estimate w twenty times using different initial lu 
value that is assigned to all amino acid sites. The initial w values range from 0.01 to 2.00, at an interval 
of 0.1. About 90% of the computation converges. For each protein, all different converged estimations 
among the twenty calculations give identical u> parameters at individual codon positions. 

Natural selection at protein folding nucleus. The estimation of site-specific w-values can 
uncover residues important for biological function, for structural stability, and potentially for folding 
kinetics. In this study we focus on the natural selection of folding nucleus residues which are identified 
by 4>- value experiments. An example for estimated w values is shown in Fig 1. 

We first examine the patterns of w-ratio of nonsynonymous vs. synonymous substitutions in the 
seven proteins. If folding nucleus residues are more conserved than other residues, selection pressure 
then must be correlated with the extent of participation in folding nucleus [9]. Following Larson et al, 
we examine directly the correlation of the 0- values and the w-values of characterized residues for each 
protein. This approach helps to circumvent the unavoidable arbitrariness in the assignment of the set 
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Figure 2: Participation in folding nucleus as measured by experimental 0-value and selective pressure as measured by 
w-value are poorly correlated. 



of folding nucleus residues [9, 13]. Residues with characterized 0- values for these proteins are obtained 
from references cited in [9]. Following [8], we exclude residues with <j) < —0.5 or <p > 1.5, and require all 
^-values to have standard deviation < 1.0, with the exception of protein U1A (lurn), where no data of 
standard deviations are provided. 

Among the set of residues with experimentally characterized ^-values, there is little correlation 
between </>-value and w-value (Fig 2). The R 2 values range between 0.0 to 0.22, and the two-sided 
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Figure 3: The weighted mean value u = J2 m =iPm 1 of estimated w ratio at each residue position of the proteins. 
The x-axis shows the residue number of the protein, the y-axis shows the estimated Co at each residue position. The 
horizontal line marks the median Q value of all positions. Folding nucleus residues as identified in [7] are marked by 
Except protein CheY, randomization tests show that folding nucleus residues are not more conserved than the rest of the 
protein, and in all protein cases (including CheY protein) are not more conserved than the residue at 50% quantile of all 
residues ranked by w. 

p- values of f-test for the null hypothesis that the slope of the linear regression models is range from 
9% to 99% (Table I). That is, there is no indication of significant correlation between the extent of 
kinetic participation as measured by 0-value and selection pressure as measured by w-value. Our results 
are similar to those found in [8,9], where relative entropy instead of co was used as the measure of 
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Table I: The conservation and packing of folding nucleus residues. Np ro f. number of residues in the protein sequence; 
TVseq: number of sequences; A^: number of residues with 0-value measured, (a) Correlation of participation in folding 
nucleus as measured by </>-value and selection pressure as measured by u. R 2 '. the fraction of variance in the data that 
can be explained by the linear regression model; p: the two-sided p-value oft-test for the null hypothesis that the slope of 
the linear regression models is 0. (b) Randomization tests for assessing statistical significance of conservation of folding 
nucleus residues. The median u> value of the folding nucleus is tested against the distribution of the median ui value from 
10 5 random samples containing the same number of amino acid residues as that of the folding nucleus drawn from the 
same protein. p^\\: the p-value that the folding nucleus residues are more conserved than all other residues in the protein; 
^50%' ^ e P" va ' ue tnat folding nucleus residues are more conserved than the residue at 50% quantile of all residues ranked 
by u-value. (c) Packing analysis of the folding nucleus and of the whole protein. The average alpha coordination number 
Z a for all residues in the protein (Z a a ||) and for residues in the folding nucleus residues (Z Q f n ) are listed, except for 
structures determined by NMR techniques. Protein CheY has the highest Z a f n . 



evolutionary conservation. 

The weighted mean values of estimated oj ratio Cu = ^2 m p m ■ w TO at each codon position are plotted 
in Figure 3. It is clear that for each protein, many folding nucleus residues as defined in [7] have small 
values of u, many are often smaller than the median w-value of all codon positions. This indicates that 
folding nucleus residues experience purifying selection pressure. However, there are also many other 
residues with small u- value, some of which have not be characterized by 0- value studies. As discussed 
in [9], the lower w-values of folding nucleus as defined in [7] residues could also be a reflection of the 
experimental bias in choosing conserved protein core residues for Rvalue experiments. Can we still 
conclude that experimentally identified folding nucleus residues in general are more conserved than the 
rest of the protein? 

We use a randomization test following the approach first developed in [7] to address this question. 
The null hypothesis H is that nucleus residues have equal or greater median u values than that of the 
whole protein. That is, folding nucleus residues are no more conserved than the whole protein sequence. 
The alternative hypothesis H a is that folding nucleus residues have less median ui values than the whole 
protein sequence and are evolutionarily more conserved. We calculate the median of u values of the 
nucleus residues as defined in [7] , and compare it with the distribution of median of uu value in random 
samples containing the same number of residues drawn from the same protein. As in [7] , we use a sample 
size of 10 5 . The fraction of the random samples with median uo value smaller than that of the folding 
nucleus provides the p- value that the observed median w-values of the folding nucleus is due to random 
chance. Similar to [7], we use the threshold of p = 2% to decide whether evolutionary conservation of the 
folding nucleus is statistically significant. Table I shows that p-value ranges between 0.26% (CheY) and 
43% (ADA2h), but the majority are between 4.0% (FKBP12) and 43% (ADA2h). With the exception 
of CheY, the null hypothesis cannot be rejected with statistical significance at the confidence level of 
p < 2%. That is, except CheY, folding nuclei as defined in [7] in these proteins are not significantly 
more conserved than the rest of the protein. 

To further assess selection pressure on folding nucleus residues, we evaluate a different null hypothesis, 
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again using randomization test. The null hypothesis H now is that the folding nucleus residues have 
equal or greater median u-values than the residue with median w-value of the whole protein. That is, 
folding nucleus as defined in [7] are no more conserved than the residue halfway in the rank ordered list 
of all residues when sorted by estimated mean w-value. Table I shows that the p- values range from 8.2% 
to 99%. With the criterion of p < 2%, the null hypothesis cannot be rejected with statistical significance 
for any of the proteins. That is, folding nucleus for every protein studied here is not significantly more 
conserved than the residue with median u- value. 

Conservation of folding nucleus of CheY. CheY is the only protein among those studied here 
that may have a well-conserved folding nucleus based on results of the first readomization test. Correla- 
tion study of (j)- value and conservation measured by reduced entropy also suggested that CheY protein 
has a well-conserved folding nucleus [9]. What are the possible reasons for the strong conservation of 
folding nucleus in this protein? It was suggested earlier that tightly packed protein interior residues are 
well conserved and these are often part of the folding nucleus residues [4, 6, 49]. We use a parameter z a 
recently introduced in [50] to characterize protein local packing. z a is defined clS Zq = Tic /n, where n c is 
the number of non-bonding atomic alpha contacts between different residues, and n is the total number 
of atoms. Two atoms are in alpha contact if they are separated by a weighted Voronoi facet which 
intersects with the protein [50]. z a characterizes protein packing more faithfully than other parameters 
such as radius of gyration [50]. 

We calculate z a for the folding nucleus as defined in [7] and for the whole protein (Table I). We 
find that the folding nucleus of CheY has the highest z a value (3.60) compared to the folding nuclei of 
other proteins, whereas the whole protein z a value of CheY has similar values to other proteins. This 
indicates that the folding nucleus of CheY has significantly larger z a than the rest of CheY protein. 
The folding nucleus of CheY is packed tighter than folding nuclei in other proteins. This observation 
can intuitively explain the significant conservation in CheY: tight packing in this case is accompanied 
by little tolerance to mutation, since the lack of packing defects such as voids reduces the possibility for 
substitution of different amino acid residues. However, this is a rather tentative hypothesis. It is possible 
that very tightly packed residues are more conserved, independent of whether they are in folding nucleus 
or not. It is also possible that if results of additional experimental 0-value studies become available, the 
definition of the folding nucleus might change. To fully resolve the relationship of packing, folding, and 
evolutionary conservation, more detailed additional studies are required, which is beyond the scope of 
this work. 

Reconstructing ancestral folding nucleus. The approach used in this study can also suggest 
further experimental exploration of evolution history of protein folding dynamics. With the continuous 
time Markovian model, we can reconstruct likely candidate sequences of ancestral proteins at different 
evolutionary times. Specifically, identities of amino acid residues in the folding nucleus of ancient 
ancestral proteins can be postulated. 

As an example, we show the reconstructed residues of the folding nuclei of FKBP12 as defined in [7] 
in Figure 4. The five folding nucleus residues are VVVLVI in human FKBP12 protein. The first residue 
is L in some reconstructed ancestral genes, the second can be Y or N, the third can be L, and the sixth 
can be a V instead of I. Based on this simple analysis, an interesting quadruplet mutagenesis study can 
be suggested to experimentally test the folding dynamics of mutated FKBP12, where the folding nucleus 
is changed. The reconstructed ancient folding nuclei suggests a combination of residues represented by 
the pattern L{N,Y}LLV r V. Here {N,Y} means either a N or a Y residue is drawn. 

The fourth residue L and fifth residue Y in all ancestral genes are the same as that in human FKBP12, 
but inspection of sequences of other extant species show that the fourth residue can be any of I, P, or V, 
and the fifth can be any of I, V, L, and M. A further interesting experiment could be to test the folding 
behavior of 6-tuple mutants with folding nucleus formed by any combination of residues represented 
by the pattern L{N,Y}{I,P,V}{I,V,L,M}V. The recreated proteins then can be assayed for folding 
behavior, which can be compared with that of proteins present in extant organisms. Such experimental 
palaeobiochcmistry was already envisioned by Pauling and Zuckcrkandl many years ago [51], and the 
number of such studies is rapidly growing [52-57]. An in-depth study on recreating the full sequence 
of ancestral proteins will require additional detailed analysis, including choosing the most appropriate 
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Figure 4: Reconstructed ancestral protein sequences of FKBP12 protein, (a). The relevant part of the phylogenetic tr 
for FKBP12 is shown. Human FKBP12 protein from which experimental data were obtained is shown in shadow, (t 
Multiple alignment of DNA sequences of the folding nucleus of FKBP12 protein, including those of reconstructed foldi 
nucleus of ancestral proteins, (c). Multiple alignment of translated amino acid residue of the folding nucleus residu 
identified by </>-value studies (highlighted) and flanking residues. 



detailed evolutionary model [58-60] . 

Discussion. Although folding nucleus is under purifying pressure, we fail to observe significant 
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conservation for protein folding nucleus residues. Despite concerns raised in [9] about the specific choices 
of the data in [7], we use exactly the same set of proteins, the same definition of nuclei residues, and 
follow the same radomizatin test as that of [7]. It is possible that this would bias our study towards 
reproducing the results of [7]. Nevertheless, our results are similar to that of Plaxco et al and Larson 
et al [8,9], and are different from that of Mirny and Shakhnovich [7]. The different conclusion of this 
study and that of [7] is likely due to the different evolutionary models employed, namely, the difference 
between a DNA-codon based continuous-time Markov model vs. an implicit evolution model implied by 
entropy calculation. The conclusion that folding nuclei residues are not conserved will likely to remain if 
we were to use the data set and the definitions of folding nuclei from reference [9] . Experimental studies 
in barnase, SH3 domain, chymotrypsin inhibitor 2 suggest that the folding nucleus observed in wild 
type protein may not be indispensable, and alternative folding nucleus may arise if residues are mutated 
[61-65]. Another experimental example is Im9 and Im7 proteins. They are E colicin-binding immunity 
proteins that are of the same fold with about 60% sequence identity. The folding of Im9 and Im7 are two- 
state and three-state process, respectively. Although these two proteins have similar folding mechanism, 
(f)- value studies reveal that the kinetically important residues are different [66, 67]. This is consistent with 
recent simulation studies which suggest that evolution selection is more robust for residues important 
for stability than for kinetic accessibility [68,69]. In addition, the definition of a folding nucleus is 
arbitrary, because it is based on a threshold of 4> value {e.g., <f) > 0.5) [13]. An earlier study suggested 
that the critical nucleus may be as large as 10 2 residues, the size of a whole protein domain [70]. The 
non-uniqueness of folding nucleus was pointed out in a study using off-lattice model system [71]. The 
role of protein structure in folding is discussed from the viewpoint of small- world connections in [72]. 
Recent computational studies based on exact enumerable lattice models using master equation showed 
that there are remarkable heterogeneity in structural contacts underlying macroscopic two-state folding 
kinetics of model Go protein [20,21]. The kinetic barrier was shown to result from a reduced number 
of microroutes near the bottom of the folding funnel [20,21]. If these studies portray accurately the 
microscopic picture of the folding process, there are likely to be many different native contacts that form 
folding nuclei for different folding pathways in the free energy landscape. It is reasonable to expect that 
a large subset of residues are capable of providing critical native contacts, and these contacts vary for 
different microscopic folding pathways. The roles of these residues in folding are largely interchangeable, 
and this may be reflected in the lack of extraordinarily strong purifying selection pressure in the current 
set of folding nucleus residues characterized by cfy value studies. 

In summary, we use a continuous time Markovian model [25] and apply a maximum likelihood 
estimator developed in [27] to study the evolution of protein folding dynamics. We examine the coding 
DNA sequences rather than amino acid residue sequences, and assess selection pressure by estimating 
the ratio u> of nonsynonymous vs. synonymous codon substitution rate. The position specific rate ratio 
is used to distinguish substitutions fixed by evolution and by chance. We found that folding nucleus 
residues experience purifying selection pressure, but they are not significantly more conserved than the 
rest of the residues of the whole protein. The only exception is CheY protein, where the folding nucleus is 
significantly more conserved. This may be due to extraordinarily tight packing, which is reflected by the 
high alpha coordination number Z a . Results described here provides another confirmation that evolution 
does not preserve kinetically important residues, which has been a subject of debate in literature [7— 
9]. We further suggest exploratory palaeobiochemical studies testing the evolution of protein folding 
dynamics. 
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